League of Legends Summer Tournament Data Analysis

By William Weihnacht and Daniel Song

intro2.jpeg

Introduction

In recent years, League of Legends has grown not only in popularity as the most played e-sports game but also as a venue for data analytics.  League of Legends is arguably the most played video game on PC and the most watched e-sport on the planet, with daily viewers in the hundreds of thousands on Twitch. As a video game, it has intricately complex dynamics where two teams of five players each compete in a part arms-race part-battle to basically eliminate one another’s towers and central nexus. Each game is like a mini-war that usually resolves within 30 minutes to an hour depending on how quickly one team is able to gain an advantage over the other. League of Legends currently consists of 140+ distinct champions (avatars) each with unique abilities and properties, yielding different advantages and disadvantages. And Riot Games (the creator of League of Legends) is consistently churning out new champions every few months, resulting in a constantly shifting meta. Each base is protected by a set of defensive towers that shoot at the first enemy they come across. The nexus, or command center, of each base generates AI-controlled minions that serve as foot soldiers for each team and are killed by the opponents to amass gold and experience points used to strengthen their abilities. The game ends when one of the Nexuses are destroyed. For more information on the rules click here.

The dynamics of the game can become quite complicated as all these factors stack together and as such analyzing this game can be rather complicated. However, as it has arguably remained the figure head of e-sports over the past 10 or so years since its release there is a lot of scrutiny on matchup statistics and match analysis due to its size. In our project, we hope to shed some light on the driving forces of the current meta as well as model the likelihood of a team winning given their current resources.

elixir.jpg

Data collection

To collect the data for our project we were lucky in that there is a plethora of League related data widely accessible including Riot’s own match data API. However, instead of scraping data from random public matches, we chose to collect data from an already existing professional League analysis website named Oracle’s Elixir so we can analyze the game at the highest level of play and in the current meta. This meant the main challenge here was reformatting it into a way that we wanted to use it.

Oracle's Elixir is the No. 1 LoL Esports Statistics + Analytics Website founded by Tim Sevenhuysen.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api
In [4]:
#read in dataset
data = pd.read_excel('./2019-summer-match-data-OraclesElixir-2019-06-24.xlsx')

To start off there is much more data than we need to get a good idea of the current meta game so we can go ahead and select the most relevant 45 of the 98 columns.

In [5]:
#select relevent columns
less_data = data[['date','gameid','playerid','player','team','side','gamelength','result','k','d','a','teamkills','teamdeaths','fb','kpm','okpm','fd','fdtime','teamdragkills','oppdragkills','ft','firstmidouter','firsttothreetowers','teamtowerkills','opptowerkills','fbaron','teambaronkills','oppbaronkills','dmgtochampsperminute','wpm','wcpm','totalgold','goldspent','csat10','oppcsat10','csdat10','goldat10','oppgoldat10','gdat10','goldat15','oppgoldat15','gdat15','xpat10','oppxpat10','xpdat10']]

Data Processing

Now that we have a DataFrame full of statistics we can start to break it down into more relevant subsets. We want to look at performance based on individual champions, as well as what the most successful teams are prioritizing doing.

To get clean data for team stats we will simply select all rows where the player is listed as team, which means the column contains the team averages for that game. Once we have aggregated these we group all rows that correspond to the same team together and average them to get a table of team averages across all games this tournament. Doing so makes the variables playerid, team kills, and team deaths, redundant or obsolete so they are dropped.

All our data is grouped by teams. As a result, the different teams in our dataset serve as the index for the dataframe. This will make a difference when looking at distributions of win percentages in the machine learning section, where the distribution will show the average number of teams on the y-axis and the win percentage on the x-axis.

In [6]:
#team stats for each game
team_stats = less_data.loc[less_data['player'] == 'Team']
In [7]:
#team data averaged across all games
avg_team_data = team_stats.groupby(['team']).mean()
avg_team_data.drop(['playerid','teamkills','teamdeaths'],1,inplace=True)
avg_team_data.head()
Out[7]:
gamelength result k d a kpm okpm fdtime teamdragkills oppdragkills ... csdat10 goldat10 oppgoldat10 gdat10 goldat15 oppgoldat15 gdat15 xpat10 oppxpat10 xpdat10
team
100 Thieves 34.847917 0.375000 9.625000 12.125000 26.000000 0.271693 0.356500 10.055717 1.250000 3.250000 ... -33.250000 14695.750000 16037.625000 -1341.875000 23126.125000 25380.750000 -2254.625000 18223.000000 18705.375000 -482.375000
AHQ e-Sports Club 36.662500 0.625000 12.625000 9.125000 29.500000 0.366591 0.250843 8.835340 3.000000 2.250000 ... -0.625000 15873.500000 15221.000000 652.500000 25036.250000 23743.625000 1292.625000 18590.375000 18360.500000 229.875000
Afreeca Freecs 33.570833 0.500000 12.312500 12.312500 28.750000 0.373947 0.366536 9.033907 2.437500 2.375000 ... 14.937500 15656.125000 15645.375000 10.750000 24777.562500 24646.750000 130.812500 18476.437500 18545.875000 -69.437500
Alpha Esports 29.338095 0.428571 9.000000 11.142857 23.142857 0.302010 0.386755 7.372802 1.571429 2.285714 ... -13.857143 15398.142857 15415.428571 -17.285714 23850.142857 24148.714286 -298.571429 18245.571429 19069.428571 -823.857143
Bilibili Gaming 31.320833 0.500000 11.416667 13.000000 29.000000 0.365807 0.445617 NaN 1.416667 2.416667 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 35 columns

Next, we look at individual player data with the goal of finding what champions are used to the greatest effect. Some data is listed as either and empty string or a single space instead of “not a number” so we change both to NaN then drop them all at once.

In [8]:
#collect average champion data
champ_data = data[['champion','position','result','earnedgoldshare','dmgshare','wardshare']]
champ_data.replace('', np.nan, inplace=True)
champ_data.replace(' ', np.nan, inplace=True)
champ_data.dropna(inplace=True)
champ_data.head()
/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py:4042: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  method=method)
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
Out[8]:
champion position result earnedgoldshare dmgshare wardshare
0 Aatrox Top 1 0.235685 0.273088 0.201835
1 Hecarim Jungle 1 0.176841 0.123226 0.100917
2 Sylas Middle 1 0.254719 0.255993 0.091743
3 Varus ADC 1 0.224666 0.253611 0.110092
4 Galio Support 1 0.108088 0.094082 0.495413

Initial Analysis + Data Visualization

map.jpg

calculate side vs win

First off, since I have seen theories that teams wearing a certain color might have a slight advantage in various sports and video games so I thought it would be interesting. I was impressed when I found that blue teams have one 55.6% of the time. After some research, I found that this is because the blue team gets to pick the first champion giving them a statistical edge in pro play.

In [9]:
#calculate side vs win%
r = 0
b = 0

for index,row in team_stats.iterrows():
    if row['side'] == 'Red' and row['result'] == 1:
        r += 1
    elif row['side'] == 'Blue' and row['result'] == 1:
        b += 1
        
plt.pie([r,b],labels=['Red','Blue'],colors=['Red','Blue'],autopct='%1.1f%%')
plt.title("Win rate by side")
plt.show()

calculate first to 3 towers vs win, first tower vs win

zappp-1024x468.jpg

I now look at what variable present in the data would likely be the biggest indicator of winning the match: first to three towers. Since this is the minimum number of towers you must destroy to attack the main objective it makes sense that this would be a big indicator of success. We can see that being first to three towers means you have an almost 40% chance of winning.

In [10]:
#calculate first to 3 towers vs win%
f_w = 0
f_l = 0
nf_w = 0
nf_l = 0

for index,row in team_stats.iterrows():
    if row['firsttothreetowers'] == 1 and row['result'] == 1:
        f_w += 1
    elif row['firsttothreetowers'] == 1 and row['result'] == 0:
        f_l += 1
    elif row['firsttothreetowers'] == 0 and row['result'] == 1:
        nf_w += 1
    elif row['firsttothreetowers'] == 0 and row['result'] == 0:
        nf_l += 1
        
plt.pie([f_w,nf_w,nf_l,f_l],labels=['First and won','Not first and won','Not first and lost','First and lost'],colors=['MediumSeaGreen','Green','Red','Salmon'],autopct='%1.1f%%')
plt.title("Win rate by first to three towers")
plt.show()

Based on this it makes sense that the first tower will give you slightly less but still significantly increased odds of winning. It’s interesting that the first tower gives an 8.7% bump but getting the next two first will only increase your chance of winning another 5.7%. This is testament to the fact that resources can be used to “snowball” an advantage and their value diminishes as time goes on.

In [11]:
#calculate first tower vs win%
ft_w = 0
ft_l = 0
nft_w = 0
nft_l = 0

for index,row in team_stats.iterrows():
    if row['ft'] == 1 and row['result'] == 1:
        ft_w += 1
    elif row['ft'] == 1 and row['result'] == 0:
        ft_l += 1
    elif row['ft'] == 0 and row['result'] == 1:
        nft_w += 1
    elif row['ft'] == 0 and row['result'] == 0:
        nft_l += 1
        
plt.pie([ft_w,nft_w,nft_l,ft_l],labels=['First and won','Not first and won','Not first and lost','First and lost'],colors=['MediumSeaGreen','Green','Red','Salmon'],autopct='%1.1f%%')
plt.title("Win rate by first tower")
plt.show()

calculate first blood vs win

firstblood.jpg

Similarly, first blood only gives a 3.6% increased chance of winning which is reasonable because it only nets you 400 gold as opposed to 1000 for a tower or 300 for a normal kill.

In [12]:
#calculate first blood vs win%
fb_w = 0
fb_l = 0
nfb_w = 0
nfb_l = 0

for index,row in team_stats.iterrows():
    if row['fb'] == 1 and row['result'] == 1:
        fb_w += 1
    elif row['fb'] == 1 and row['result'] == 0:
        fb_l += 1
    elif row['fb'] == 0 and row['result'] == 1:
        nfb_w += 1
    elif row['fb'] == 0 and row['result'] == 0:
        nfb_l += 1
        
plt.pie([fb_w,nfb_w,nfb_l,fb_l],labels=['First and won','Not first and won','Not first and lost','First and lost'],colors=['MediumSeaGreen','Green','Red','Salmon'],autopct='%1.1f%%')
plt.title("Win rate by first blood")
plt.show()

calculate first dragon vs win

gi-modes-sr-the-dragon.jpg

Looking at first to kill a dragon, it is more significant than first blood but less than first tower. Dragons are important for building advantages as their buff lasts the entire game but don’t give the same gold and lane control advantage as a tower.

In [13]:
#calculate first dragon vs win%
fd_w = 0
fd_l = 0
nfd_w = 0
nfd_l = 0

for index,row in team_stats.iterrows():
    if row['fd'] == 1 and row['result'] == 1:
        fd_w += 1
    elif row['fd'] == 1 and row['result'] == 0:
        fd_l += 1
    elif row['fd'] == 0 and row['result'] == 1:
        nfd_w += 1
    elif row['fd'] == 0 and row['result'] == 0:
        nfd_l += 1
        
plt.pie([fd_w,nfd_w,nfd_l,fd_l],labels=['First and won','Not first and won','Not first and lost','First and lost'],colors=['MediumSeaGreen','Green','Red','Salmon'],autopct='%1.1f%%')
plt.title("Win rate by first dragon")
plt.show()

individual role/champion data

Summoners-Rift-map-layout-basic.jpg

Next, we will look at individual role/champion data. First, I count the amount of times each champion was played at a certain position and remove champions who were not used at least 10 times in a position. Since we are dealing with averages having outliers such as a champion who was used once at position and won would skew the data.

The Support and ADC (usually a melee range champion) operate in the bottom lane. The remaining roles are named after the lanes/area they operate in: Top operates in the top lane. Mid operates in the middle lange. Jungle can operate in all lanes when wanting to support his/her teammates but mainly operates in the jungle (forested areas in between the lanes).

Tidying champion data

In [14]:
#count each time a champion was picked for each position
pick_counts = champ_data.groupby(['position','champion']).size()
In [15]:
#remove data for champions used less than 10 times at a given position
champ_data2 = []
for index,row in champ_data.iterrows():
    if pick_counts[row['position']][row['champion']] > 10:
        champ_data2.append(row)
        
champ_data2 = pd.DataFrame(champ_data2)

Here I group the data by position and champion and average it, rename result to win percent for clarity, and then split the DataFrame by position played, and made a little function to graph each position cleanly.

In [16]:
#Average the data
avg_position_data = champ_data2.groupby(['position','champion']).mean()
avg_position_data.rename({'result':'Win Percent'},inplace=True,axis=1)
In [17]:
#Split the data by position
top,mid,adc,jg,sup = [],[],[],[],[]
for index,row in avg_position_data.iterrows():
    if index[0] == "ADC":
        adc.append(row)
    if index[0] == "Jungle":
        jg.append(row)
    if index[0] == "Middle":
        mid.append(row)
    if index[0] == "Support":
        sup.append(row)
    if index[0] == "Top":
        top.append(row)
In [18]:
#function to clean the data for graphing
def reindex(pos):
    df = pd.DataFrame(pos)
    i = []
    for index,row in df.iterrows():
        i.append(index[1])
    df.index = i
    return df
In [19]:
#applying the function to the dataframes created
top = reindex(top)
mid = reindex(mid)
adc = reindex(adc)
jg = reindex(jg)
sup = reindex(sup)

Champion stats for the position top

In the top position the big winner jumps out a Sylas with a win percentage of over 60% over 24 games. Sylas was a strong Mid-lanner in patch 9.1 but it is a testament to his utility that he is used in top lane too. Interestingly the overall damage and gold share of the most successful heroes in this rank is lower than that of the more middle of the pack ones which makes sense in a way because top is an isolated position and one can have success in their own lane despite their team losing control in other lanes.

In [20]:
top.sort_values('Win Percent').plot.bar()
plt.ylabel("Percentage")
plt.show()

Champion stats for the position middle

In the middle role, we see Sylas is also dominant here, but on par with Yasuo, another strong pick. There is a lot of variety in this position and some overlap with top as they have similar roles. Akali, for example is used in both but to more success in middle than top likely because she is more of a ganker (ganking refers to ambushing an enemy in a different lane). Once again some of the lower win-rate heroes have good damage and gold share but this can once again potentially be attributed to the bottom-lane failing to over-power the enemy and being pushed in, leading to higher proportional sucsess from the more individual middle and top lanes.

In [21]:
mid.sort_values('Win Percent').plot.bar()
plt.ylabel("Percentage")
plt.show()

The ADC role appears to have more of a set meta with only 8 commonly used champions as opposed to the 15 used at middle. ADC, which stands for attack damage carry, is tasked with outputting the majority of team damage towards the end of the game. While Kalista has by far the highest win rate this can be a little deceiving as she was only used 11 times compared to the 92 of Sivir and 90 of Xayah. With a respectable win rate of 55% in 28 games Varus seems to be one of the stronger damage dealers at this position, while the supposedly meta aformentioned Xayah and Sivir saw mediocre win rates.

Champion stats for the position ADC

In [22]:
adc.sort_values('Win Percent').plot.bar()
plt.ylabel("Percentage")
plt.show()

Champion stats for the position jungle

Junglers are tasked with killing their jungles spawns, stealing the enemies, and ganking. Once again, Karthus has a deceptively high win rate given he has only played 11 games but nonetheless he is an effective damage dealer but lacks the utility of other champions. Trundle is also effective although he lacks damage he provides utility by hindering the enemies in gank situations to secure extra kills, accounting for his low damage share yet high success.

In [23]:
jg.sort_values('Win Percent').plot.bar()
plt.ylabel("Percentage")
plt.show()

Champion stats for the position support

The support position is tasked with vision control and enabling the ADC and other roles through buffs. Once again, the highest win percentage, Pyke, is from a smaller sample size. However, unlike some other roles two of the most played heroes are also among the most successful. Rakan is strong as a standard healer while Tahm Kench has an ultimate that provides motility to teammates and a stun.

In [24]:
sup.sort_values('Win Percent').plot.bar()
plt.ylabel("Percentage")
plt.show()

Overall there are some champions such as Sylas and Varus that are used frequently and yield a high win rate. However, it seems that often the most successful picks are situational and while not outright the strongest at their position can be slotted into a team that synergizes with them or as a counter pick that allows them to take advantage of the other team’s weaknesses. This is a testament to the fluid nature of the game and that while some champions are overall strong at certain points in time there is generally room for counter play, meaning that the game is well designed and balanced, and that ingenuity will prevail over spamming “meta picks”.

In [25]:
from sklearn.model_selection import train_test_split
from scipy.stats import f
import seaborn as sns
from sklearn import model_selection
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
import statsmodels.formula.api as smf
import warnings
from sklearn.model_selection import KFold
warnings.filterwarnings('ignore')
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics
from sklearn import datasets, linear_model
from mpl_toolkits import mplot3d

league-of-legends-champions.jpeg

Supervised Machine Learning [The Metagame]

We did supervised machine learning on seven groups of gameplay factors to determine win percentage:

  1. champion kills
    • kpm = kills per minute
    • teamkills = champion kills made by team
    • a = assists, assisted kills
  2. not dying
    • teamdeaths = number of total deaths by team
  3. large monster kills
    • teamdragkills = Total dragons killed by team.
    • herald = Rift herald taken (1 yes, 0 opponent took it, blank herald not killed).
    • fbaron = First baron of game killed (1 yes, 0 no).
    • teambaronkills = Total barons killed by team.
  4. tower kills
    • ft = First tower of game killed (1 yes, 0 no).
    • firstmidouter = First team to kill mid lane outer tower (1 yes, 0 no).
    • firsttothreetowers = First team to kill three towers (1 yes, 0 no).
    • teamtowerkills = Total towers killed by team.
    • fttime = First tower kill time, in minutes.
  5. creep score
    • minionkills = Lane minions killed.
    • monsterkills = Neutral monsters killed.
    • cspm = Creep score per minute. All creep score variables include minions and monsters.
    • csat10 = Creep score at 10:00.
    • csdat10 = Creep score difference at 10:00.
  6. vision
    • wards = Total wards placed (of all types).
    • wpm = Wards placed per minute (or all types).
    • wardkills = Total wards cleared/killed (of all types).
    • wcpm = Total wards cleared/killed per minute (of all types).
    • visionwards = Vision/control wards placed.
    • visionwardbuys = Vision/control wards purchased.
  7. economics
    • totalgold = Total gold earned from all sources.
    • earnedgpm = Earned gold per minute.
    • goldspent = Total gold spent.
    • gspd = Gold spent percentage difference.
    • goldat10 = Total gold earned at 10:00.
    • gdat10 = Gold difference at 10:00.
    • goldat15 = Total gold earned at 15:00.
    • gdat15 = Gold difference at 15:00.

Tidy our data for machine learning

As a reminder, all our data is grouped by teams.

As a result, the different teams in our dataset serve as the index for the dataframe. This will make a difference when looking at distributions of win percentages in the machine learning section, where the distribution will show the average number of teams on the y-axis and the win percentage on the x-axis.

Reference for gameplay factors: http://oracleselixir.com/match-data/match-data-dictionary/

In [148]:
by_team_data = data.loc[less_data['player'] == 'Team']
by_team_data.fillna(0, inplace = True)
team_data = by_team_data.groupby(['team']).mean()
result = team_data[['result']]
In [149]:
team_data.head()
Out[149]:
week game patchno playerid gamelength result k d a teamkills ... gdat15 xpat10 oppxpat10 xpdat10 csat10 oppcsat10 csdat10 csat15 oppcsat15 csdat15
team
100 Thieves 2.500000 1.625000 9.110000 150.000000 34.847917 0.375000 9.625000 12.125000 26.000000 9.625000 ... -2254.625000 18223.000000 18705.375000 -482.375000 287.375000 320.625000 -33.250000 450.875000 509.000000 -58.1250
AHQ e-Sports Club 1.925000 1.875000 9.117500 150.000000 36.662500 0.625000 12.625000 9.125000 29.500000 12.625000 ... 1292.625000 18590.375000 18360.500000 229.875000 316.375000 317.000000 -0.625000 508.250000 497.250000 11.0000
Afreeca Freecs 2.275000 1.875000 9.113125 150.000000 33.570833 0.500000 12.312500 12.312500 28.750000 12.312500 ... 130.812500 18476.437500 18545.875000 -69.437500 324.437500 309.500000 14.937500 494.500000 481.187500 13.3125
Alpha Esports 1.885714 1.714286 9.117143 157.142857 29.338095 0.428571 9.000000 11.142857 23.142857 9.000000 ... -298.571429 18245.571429 19069.428571 -823.857143 305.285714 319.142857 -13.857143 481.714286 501.714286 -20.0000
Bilibili Gaming 2.916667 1.750000 9.102500 133.333333 31.320833 0.500000 11.416667 13.000000 29.000000 11.416667 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0000

5 rows × 71 columns

triple.jpg

1. Win Percentage based on Champion Kills

We pick the three factors that influence "Champion Kills":

- kpm (kills per minute)
- teamkills (total kills by team)
- a (assisted kills by teammates)

With these three factors, we run a linear regression to fit the data.

In [150]:
# champ_kills is a sub-dataframe of team_data based on the Champion Kills factors.
champ_kills = team_data[['kpm','teamkills','a']]
# Set X as champion kills features
X = champ_kills
# set our win_percentage as y
y = result
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
In [151]:
# run the linear regression on the training set using statsmodel library
champ_kills_LRM = smf.OLS(y_train, X_train).fit()

# Predict win percentage using our regression model on the Test data
preds_champKillsLRM = champ_kills_LRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_champKillsLRM = cross_val_predict(model, X, y, cv=5)

# Plot the predicted values (linear regression, k-fold cross validation) against the actual values
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Champion Kills')
# plot actual values for win_percentage
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
# plot linear regression values for win_percentage
sns.distplot(preds_champKillsLRM, hist=False, label="Linear Regression Predictions", ax=ax)
# plot linear regression values for k-fold cross validation
sns.distplot(KFCV_champKillsLRM, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()

Champion Kills Statistics

  • The R-squared value shows that over 90% of the variance is explained by the factor of champion kills.
  • Of the factors analyzed (killsPerMinute, teamkills, assists), only kpm "kills per minute" came out as statistically relevant because of its low p-value.
In [152]:
champ_kills_LRM.summary()
Out[152]:
OLS Regression Results
Dep. Variable: result R-squared: 0.931
Model: OLS Adj. R-squared: 0.926
Method: Least Squares F-statistic: 188.9
Date: Sat, 20 Jul 2019 Prob (F-statistic): 2.10e-24
Time: 01:14:38 Log-Likelihood: 23.477
No. Observations: 45 AIC: -40.95
Df Residuals: 42 BIC: -35.53
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
kpm 1.4654 0.662 2.212 0.032 0.129 2.802
teamkills -0.0066 0.042 -0.157 0.876 -0.092 0.079
a 0.0019 0.013 0.143 0.887 -0.025 0.029
Omnibus: 2.344 Durbin-Watson: 2.453
Prob(Omnibus): 0.310 Jarque-Bera (JB): 2.158
Skew: -0.453 Prob(JB): 0.340
Kurtosis: 2.425 Cond. No. 931.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

notdying.jpg

2. Win Percentage based on Not Dying

We pick the single factor that influences "Not Dying":

- teamdeaths (the number of times team members have died)

With this single factor, we run a linear regression to fit the data.

In [153]:
# not_dying is a sub-dataframe of team_data based on the Champion Kills factors.
not_dying = team_data[['teamdeaths']]
# Set X as not_dying dataframe
X = not_dying

# y is already set as our win_percentage

# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)

# run the linear regression on the training set using sklearn library
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
In [154]:
# run the linear regression on the training set using statsmodel library
notDying_LRM = smf.OLS(y_train, X_train).fit()

# Predict win percentage using our regression model on the Test data
preds_NotDying = notDying_LRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_NotDying = cross_val_predict(model, X, y, cv=5)

# Plot the predicted values (linear regression, k-fold cross validation) against the actual values
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Not Dying')
# plot actual values for win_percentage
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
# plot linear regression values for win_percentage
sns.distplot(preds_NotDying, hist=False, label="Linear Regression Predictions", ax=ax)
# plot linear regression values for k-fold cross validation
sns.distplot(KFCV_NotDying, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()

Not Dying Statistics

  • The R-squared value shows that only about 70% of the variance is explained by the factor of not dying.
  • teamdeaths came out as statistically relevant because of its low p-value.
In [155]:
notDying_LRM.summary()
Out[155]:
OLS Regression Results
Dep. Variable: result R-squared: 0.732
Model: OLS Adj. R-squared: 0.726
Method: Least Squares F-statistic: 120.4
Date: Sat, 20 Jul 2019 Prob (F-statistic): 3.51e-14
Time: 01:14:39 Log-Likelihood: -6.6417
No. Observations: 45 AIC: 15.28
Df Residuals: 44 BIC: 17.09
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
teamdeaths 0.0390 0.004 10.975 0.000 0.032 0.046
Omnibus: 0.726 Durbin-Watson: 1.872
Prob(Omnibus): 0.695 Jarque-Bera (JB): 0.578
Skew: -0.270 Prob(JB): 0.749
Kurtosis: 2.870 Cond. No. 1.00


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

baron.jpg

3. Win Percentage based on Large Monster Kills

We pick the multiple factors that influences "large monster kills"

teamdragkills = Total dragons killed by team.
herald = Rift herald taken (1 yes, 0 opponent took it, blank herald not killed).
fbaron = First baron of game killed (1 yes, 0 no).
teambaronkills = Total barons killed by team.
heraldtime = Herald kill time, in minutes.
fbarontime = First baron time, in minutes.

With these multiple factors, we run a linear regression to fit the data.

  • herald, dragon and baron all refer to large monsters that reside in the jungle.
In [156]:
# largeMonsterKills is a sub-dataframe of team_data based on the "Large Monster Kills" factors.
largeMonsterKills = team_data[['teamdragkills','herald','heraldtime','fbaron','fbarontime','teambaronkills']]
# Set X as largeMonsterKills dataframe
X = largeMonsterKills
# y is already set as our win_percentage
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
In [157]:
# run the linear regression on the training set using statsmodel library
largeMonsterKillsLRM = smf.OLS(y_train, X_train).fit()

# Predict win percentage using our regression model on the Test data
preds_LargeMonsterKills = largeMonsterKillsLRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_LargeMonsterKills = cross_val_predict(model, X, y, cv=5)

# Plot the predicted values (linear regression, k-fold cross validation) against the actual values
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Monster Kills')
# plot actual values for win_percentage
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
# plot linear regression values for win_percentage
sns.distplot(preds_LargeMonsterKills, hist=False, label="Linear Regression Predictions", ax=ax)
# plot linear regression values for k-fold cross validation
sns.distplot(KFCV_LargeMonsterKills, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()

Large Monster Kills Statistics

  • The R-squared value shows that over 85% of the variance is explained by the factor of large monster kills.
  • Of the factors analyzed (teamdragkills, herald, fbaron, teambaronkills), only teamdragkills (number of dragons killed by team) came out as statistically relevant because of its low p-value.
In [158]:
largeMonsterKillsLRM.summary()
Out[158]:
OLS Regression Results
Dep. Variable: result R-squared: 0.907
Model: OLS Adj. R-squared: 0.892
Method: Least Squares F-statistic: 63.09
Date: Sat, 20 Jul 2019 Prob (F-statistic): 1.53e-18
Time: 01:14:40 Log-Likelihood: 16.343
No. Observations: 45 AIC: -20.69
Df Residuals: 39 BIC: -9.846
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
teamdragkills 0.2258 0.023 9.869 0.000 0.179 0.272
herald -0.0424 0.207 -0.205 0.839 -0.461 0.376
heraldtime 0.0414 0.020 2.104 0.042 0.002 0.081
fbaron 0.1547 0.304 0.508 0.614 -0.461 0.770
fbarontime -0.0286 0.009 -3.236 0.002 -0.047 -0.011
teambaronkills 0.1982 0.210 0.943 0.352 -0.227 0.623
Omnibus: 1.580 Durbin-Watson: 1.601
Prob(Omnibus): 0.454 Jarque-Bera (JB): 0.740
Skew: -0.190 Prob(JB): 0.691
Kurtosis: 3.500 Cond. No. 318.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

towerkills.png

4. Win Percentage based on Tower Kills

The factors that influence Tower Kills:

tower kills
ft = First tower of game killed (1 yes, 0 no).
firstmidouter = First team to kill mid lane outer tower (1 yes, 0 no).
firsttothreetowers = First team to kill three towers (1 yes, 0 no).
teamtowerkills = Total towers killed by team.
fttime = First tower kill time, in minutes.

With these factors, we run a linear regression to fit the data.

In [159]:
# tower_kills is a sub-dataframe of team_data based on the "tower kills" factors.
tower_kills = team_data[['ft','firstmidouter','firsttothreetowers','teamtowerkills','fttime']]
# Set X as tower_kills dataframe, # y is already set as our win_percentage
X = tower_kills
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
In [160]:
# run the linear regression on the training set using statsmodel library
towerKillsLRM = smf.OLS(y_train, X_train).fit()

# Predict win percentage using our regression model on the Test data
preds_TowerKills = towerKillsLRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_TowerKills = cross_val_predict(model, X, y, cv=5)

# Plot the predicted values (linear regression, k-fold cross validation) against the actual values
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Tower Kills')
# plot actual values for win_percentage
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
# plot linear regression values for win_percentage
sns.distplot(preds_TowerKills, hist=False, label="Linear Regression Predictions", ax=ax)
# plot linear regression values for k-fold cross validation
sns.distplot(KFCV_TowerKills, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()

Tower Kills Statistics

  • The R-squared value shows that above 95% of the variance is explained by the factor of tower kills.
  • Of the factors analyzed (ft, firstmidouter, firsttothreetowers, teamtowerkills, fttime), only teamtowerkills & fttime came out as statistically relevant because of their low p-value's.
In [161]:
towerKillsLRM.summary()
Out[161]:
OLS Regression Results
Dep. Variable: result R-squared: 0.979
Model: OLS Adj. R-squared: 0.977
Method: Least Squares F-statistic: 377.1
Date: Sat, 20 Jul 2019 Prob (F-statistic): 1.60e-32
Time: 01:14:41 Log-Likelihood: 49.236
No. Observations: 45 AIC: -88.47
Df Residuals: 40 BIC: -79.44
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
ft -0.0291 0.110 -0.265 0.793 -0.251 0.193
firstmidouter -0.1057 0.141 -0.750 0.457 -0.390 0.179
firsttothreetowers 0.2906 0.130 2.233 0.031 0.028 0.554
teamtowerkills 0.0921 0.004 24.439 0.000 0.085 0.100
fttime -0.0089 0.003 -2.738 0.009 -0.016 -0.002
Omnibus: 17.824 Durbin-Watson: 2.464
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.280
Skew: -1.176 Prob(JB): 1.19e-06
Kurtosis: 6.003 Cond. No. 182.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

creepscore.jpeg

5. Win Percentage based on Creep Score

We pick the multiple factors that influences Creep Score

creep score
minionkills = Lane minions killed.
monsterkills = Neutral monsters killed.
cspm = Creep score per minute. All creep score variables include minions and monsters.
csat10 = Creep score at 10:00.
csdat10 = Creep score difference at 10:00.

With these multiple factors, we run a linear regression to fit the data.

In [162]:
# minion_kills is a sub-dataframe of team_data based on the "Large Monster Kills" factors.
minion_kills = team_data[['minionkills','monsterkills','cspm','csat10','csdat10']]
# Set X as largeMonsterKills dataframe
# y is already set as our win_percentage
X = minion_kills
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library\
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
In [163]:
# run the linear regression on the training set using statsmodel library
creepScoreLRM = smf.OLS(y_train, X_train).fit()

# Predict win percentage using our regression model on the Test data
preds_CreepScore = creepScoreLRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_CreepScore = cross_val_predict(model, X, y, cv=5)

# Plot the predicted values (linear regression, k-fold cross validation) against the actual values
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Creep Score')
# plot actual values for win_percentage
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
# plot linear regression values for win_percentage
sns.distplot(preds_CreepScore, hist=False, label="Linear Regression Predictions", ax=ax)
# plot linear regression values for k-fold cross validation
sns.distplot(KFCV_CreepScore, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()

Minion Kills Statistics

  • The R-squared value shows that only about 70% of the variance is explained by the factor of creep score.
  • Of the factors analyzed (minionkills, monsterkills, cspm, csat10, csdat10), minionkills and monsterkills came out as statistically relevant because of their low p-value's.
In [164]:
minionKillsLRM.summary()
Out[164]:
OLS Regression Results
Dep. Variable: result R-squared: 0.675
Model: OLS Adj. R-squared: 0.635
Method: Least Squares F-statistic: 16.64
Date: Sat, 20 Jul 2019 Prob (F-statistic): 7.17e-09
Time: 01:14:42 Log-Likelihood: -13.206
No. Observations: 45 AIC: 36.41
Df Residuals: 40 BIC: 45.45
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
minionkills -0.0020 0.001 -1.654 0.106 -0.004 0.000
monsterkills 0.0060 0.005 1.248 0.219 -0.004 0.016
cspm 0.0828 0.065 1.266 0.213 -0.049 0.215
csat10 -0.0050 0.007 -0.727 0.471 -0.019 0.009
csdat10 0.0053 0.005 1.019 0.314 -0.005 0.016
Omnibus: 8.209 Durbin-Watson: 1.409
Prob(Omnibus): 0.016 Jarque-Bera (JB): 8.121
Skew: 1.040 Prob(JB): 0.0172
Kurtosis: 3.075 Cond. No. 1.07e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.07e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

wards.jpg

6. Win Percentage based on Vision (Warding)

  • Wards are basically little lightposts that you can plant on the map. They will last for a few minutes before they burn out. They provide vision to your team outside of the base.
  • The above picture shows different custom ward skins (designs) that are available to players.

We pick the multiple factors that influences Vision

wards = Total wards placed (of all types).
wpm = Wards placed per minute (or all types).
wardkills = Total wards cleared/killed (of all types).
wcpm = Total wards cleared/killed per minute (of all types).
visionwards = Vision/control wards placed.
visionwardbuys = Vision/control wards purchased.

With these multiple factors, we run a linear regression to fit the data.

In [165]:
# vision is a sub-dataframe of team_data based on the "vision" factors.
vision = team_data[['wards','wpm','wardkills','wcpm','visionwards','visionwardbuys']]
# Set X as largeMonsterKills dataframe, # y is already set as our win_percentage
X = vision
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
In [166]:
# Fit the Linear Regression on Train split
visionLRM = smf.OLS(y_train, X_train).fit()

# Predict using Test split
preds_Vision = visionLRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_Vision = cross_val_predict(model, X, y, cv=5)

# Plot how the predicted win_ratio compares to actual win ratio
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Vision')
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
sns.distplot(preds_Vision, hist=False, label="Linear Regression Predictions", ax=ax)
sns.distplot(KFCV_Vision, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()

Vision Statistics

  • The R-squared value shows that only about 65% of the variance is explained by the factor of vision.
  • Of the factors analyzed (wards, wpm, wardkills, wcpm, visionwards, visionwardbuys), only visionwards came out as statistically relevant because of its low p-value.
In [167]:
visionLRM.summary()
Out[167]:
OLS Regression Results
Dep. Variable: result R-squared: 0.654
Model: OLS Adj. R-squared: 0.601
Method: Least Squares F-statistic: 12.31
Date: Sat, 20 Jul 2019 Prob (F-statistic): 9.99e-08
Time: 01:14:42 Log-Likelihood: -13.178
No. Observations: 45 AIC: 38.36
Df Residuals: 39 BIC: 49.20
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
wards 0.0005 0.036 0.012 0.990 -0.073 0.074
wpm -0.0246 1.292 -0.019 0.985 -2.637 2.588
wardkills -0.0286 0.073 -0.391 0.698 -0.176 0.119
wcpm 1.4634 2.657 0.551 0.585 -3.910 6.837
visionwards 0.0999 0.071 1.403 0.168 -0.044 0.244
visionwardbuys -0.0973 0.066 -1.466 0.151 -0.232 0.037
Omnibus: 6.681 Durbin-Watson: 1.736
Prob(Omnibus): 0.035 Jarque-Bera (JB): 6.204
Skew: 0.908 Prob(JB): 0.0450
Kurtosis: 3.094 Cond. No. 7.25e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.25e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

shop3.jpg

7. Win Percentage based on Economics

We pick the multiple factors that influences "Economics"

totalgold = Total gold earned from all sources.
earnedgpm = Earned gold per minute.
goldspent = Total gold spent.
gspd = Gold spent percentage difference.
goldat10 = Total gold earned at 10:00.
gdat10 = Gold difference at 10:00.
goldat15 = Total gold earned at 15:00.
gdat15 = Gold difference at 15:00.

With these multiple factors, we run a linear regression to fit the data.

In [168]:
# economics is a sub-dataframe of team_data based on the "Large Monster Kills" factors.
economics = team_data[['totalgold','earnedgpm','goldspent','gspd','goldat10','gdat10','goldat15','gdat15']]
# Set X as largeMonsterKills dataframe, # y is already set as our win_percentage
X = economics
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
In [169]:
# run the linear regression on the training set using statsmodel library
econLRM = smf.OLS(y_train, X_train).fit()

# Predict win percentage using our regression model on the Test data
preds_Econ = econLRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_Econ = cross_val_predict(model, X, y, cv=5)

# Plot the predicted values (linear regression, k-fold cross validation) against the actual values
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Economics')
# plot actual values for win_percentage
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
# plot linear regression values for win_percentage
sns.distplot(preds_Econ, hist=False, label="Linear Regression Predictions", ax=ax)
# plot linear regression values for k-fold cross validation
sns.distplot(KFCV_Econ, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()

Economics Statistics

  • The R-squared value shows that over 90% of the variance is explained by the factor of economics.
  • Of the factors analyzed (totalgold, earnedgpm, goldspent, gspd, goldat10, gdat10, goldat15, gdat15), only totalgold, earnedgpm & goldspent came out as statistically relevant because of their low p-value's.
In [170]:
econLRM.summary()
Out[170]:
OLS Regression Results
Dep. Variable: result R-squared: 0.923
Model: OLS Adj. R-squared: 0.906
Method: Least Squares F-statistic: 55.08
Date: Sat, 20 Jul 2019 Prob (F-statistic): 3.19e-18
Time: 01:14:43 Log-Likelihood: 20.467
No. Observations: 45 AIC: -24.93
Df Residuals: 37 BIC: -10.48
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
totalgold 8.797e-06 8.24e-07 10.679 0.000 7.13e-06 1.05e-05
earnedgpm 0.0013 0.001 1.565 0.126 -0.000 0.003
goldspent -1.838e-05 8.67e-06 -2.121 0.041 -3.59e-05 -8.19e-07
gspd 0.9057 1.565 0.579 0.566 -2.264 4.076
goldat10 -3.237e-05 0.000 -0.195 0.846 -0.000 0.000
gdat10 4.033e-05 0.000 0.280 0.781 -0.000 0.000
goldat15 -3.522e-07 0.000 -0.003 0.997 -0.000 0.000
gdat15 1.237e-05 8.03e-05 0.154 0.878 -0.000 0.000
Omnibus: 5.734 Durbin-Watson: 1.428
Prob(Omnibus): 0.057 Jarque-Bera (JB): 5.232
Skew: -0.476 Prob(JB): 0.0731
Kurtosis: 4.373 Cond. No. 4.72e+06


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.72e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

intro.jpg

Win Percentaged based on Most Relevant Factors

We will now create a more comprehensive linear regression model based on the most statistically relevant factors we encountered in the seven aforementioned linear regressions.

Most relevant groups of factors:

- champion kills
- large monster kills
- tower kills
- economics

Most relevant factors:

- kpm
- teamdeaths
- teamdragkills
- teamtowerkills
- fttime
- minionkills
- monsterkills
- visionwards
- visionwardbuys
- totalgold
- earnedgpm
- goldspent
In [171]:
# relevant is a sub-dataframe of team_data based on the most relevant factors.
relevant = team_data[['kpm', 'teamdeaths', 'teamdragkills', 'teamtowerkills', 'fttime', 'minionkills', 'monsterkills', 'visionwards', 'visionwardbuys', 'totalgold', 'earnedgpm', 'goldspent']]
# Set X as relevant dataframe, # y is already set as our win_percentage
X = relevant
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
In [172]:
# run the linear regression on the training set using statsmodel library
relevantLRM = smf.OLS(y_train, X_train).fit()

# Predict win percentage using our regression model on the Test data
preds_Relevant = relevantLRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_Relevant = cross_val_predict(model, X, y, cv=5)

# Plot the predicted values (linear regression, k-fold cross validation) against the actual values
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Most Relevant Factors')
# plot actual values for win_percentage
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
# plot linear regression values for win_percentage
sns.distplot(preds_Relevant, hist=False, label="Linear Regression Predictions", ax=ax)
# plot linear regression values for k-fold cross validation
sns.distplot(KFCV_Relevant, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()
In [173]:
relevantLRM.summary()
Out[173]:
OLS Regression Results
Dep. Variable: result R-squared: 0.993
Model: OLS Adj. R-squared: 0.990
Method: Least Squares F-statistic: 375.5
Date: Sat, 20 Jul 2019 Prob (F-statistic): 1.16e-31
Time: 01:14:43 Log-Likelihood: 75.222
No. Observations: 45 AIC: -126.4
Df Residuals: 33 BIC: -104.8
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
kpm 0.2214 0.163 1.359 0.183 -0.110 0.553
teamdeaths -0.0214 0.006 -3.539 0.001 -0.034 -0.009
teamdragkills -0.0386 0.022 -1.772 0.086 -0.083 0.006
teamtowerkills 0.0910 0.014 6.517 0.000 0.063 0.119
fttime 0.0160 0.011 1.466 0.152 -0.006 0.038
minionkills -0.0002 0.001 -0.459 0.650 -0.001 0.001
monsterkills -0.0003 0.001 -0.269 0.790 -0.003 0.002
visionwards 0.0111 0.011 1.000 0.324 -0.011 0.034
visionwardbuys -0.0090 0.011 -0.801 0.429 -0.032 0.014
totalgold 3.898e-06 2.75e-06 1.420 0.165 -1.69e-06 9.48e-06
earnedgpm 0.0003 0.000 1.684 0.102 -6.38e-05 0.001
goldspent -7.264e-06 7.05e-06 -1.031 0.310 -2.16e-05 7.08e-06
Omnibus: 15.582 Durbin-Watson: 1.920
Prob(Omnibus): 0.000 Jarque-Bera (JB): 29.099
Skew: -0.884 Prob(JB): 4.80e-07
Kurtosis: 6.521 Cond. No. 1.53e+06


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.53e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

heimer.jpg

Conclusion & Analysis

  • The model based on the most relevant factors shown so far explains 99% of the variance in the data.
  • Our accuracy score turned out very well, over 92%.
  • Based on final comprehensive linear regression model, the gameplay factors most relevant to win_percentage based on p-values turned out to be kpm (kills per minute), teamdeaths (total number of times team members have died), teamtowerkills (the number of towers eliminated).
  • So the statistically proven strategy is killing champions and towers as quickly as possible while not dying as much as possible.

A follow-up analysis would be to account for time relevant factors such as accomplishing certain objectives within a certain time frame (10, 15 minutes). Another route of follow-up analysis would be to include factors regarding champion selection and/or item selection. Since each champion and item is so unique in its own right and also because the data didn't include these factors, we reserved this analysis for possible follow-up.

The accuracy score of our final comprehensive linear regression model

In [174]:
predictions = lm.predict(X_test)
plt.scatter(y_test, predictions)
plt.xlabel("True Values of win percentage")
plt.ylabel("Predictions of win percentage")
print("Score:", model.score(X_test, y_test))
Score: 0.9159380722671929

The accuracy score of our final comprehensive linear regression model based on 5-fold cross validation

In [175]:
# Perform 5-fold cross validation
scores = cross_val_score(model, X, y, cv=5)
print("Cross-validated scores:", scores)

KFCV_predictions = cross_val_predict(model, X, y, cv=5)
plt.scatter(y, KFCV_predictions)

accuracy = metrics.r2_score(y, KFCV_predictions)
print("Cross-Predicted Accuracy:", accuracy)

plt.xlabel("True Values of win percentage")
plt.ylabel("Predictions of win percentage")
plt.show()
Cross-validated scores: [0.83183106 0.96973406 0.89237215 0.92811649 0.89840953]
Cross-Predicted Accuracy: 0.9235959952857985

summer18.jpg

Validation using Summer 2018 Data

  • We will read in validation data from summer 2018
  • Our test and training data was based on summer 2019
  • As you will see below, the linear regression predictions for win percentage based on our comprehesive linear regression model, that was built on the most relevant variables, most closely fits the actual values for win percentage compared to all the other linear regression models that were based on just one group of factors.
  • Our validation data confirms that our comprehensive model is accurate.
In [183]:
# read in validation data from summer 2018
# our test and training data was based on summer 2019
dataSummer18 = pd.read_excel('./2018-worlds-match-data-OraclesElixir-2018-11-03.xlsx')
In [184]:
# tidy, preprocess the data
validation_data = dataSummer18.loc[dataSummer18['player'] == 'Team']
validation_data.fillna(0, inplace = True)
team_validation_data = validation_data.groupby(['team']).mean()
win_percentage = team_validation_data[['result']]
In [185]:
# Predict win percentage using all of our regression models on the Validation data
X_ChampKills = team_validation_data[['kpm','teamkills','a']]
preds_champKillsLRM = champ_kills_LRM.predict(X_ChampKills)

X_NotDying = team_validation_data[['teamdeaths']]
preds_NotDying = notDying_LRM.predict(X_NotDying)

X_LargeMonsterKills = team_validation_data[['teamdragkills','herald','heraldtime','fbaron','fbarontime','teambaronkills']]
preds_LargeMonsterKills = largeMonsterKillsLRM.predict(X_LargeMonsterKills)

X_TowerKills = team_validation_data[['ft','firstmidouter','firsttothreetowers','teamtowerkills','fttime']]
preds_TowerKills = towerKillsLRM.predict(X_TowerKills)

X_CreepScore = team_validation_data[['minionkills','monsterkills','cspm','csat10','csdat10']]
preds_CreepScore = creepScoreLRM.predict(X_CreepScore)

X_Vision = team_validation_data[['wards','wpm','wardkills','wcpm','visionwards','visionwardbuys']]
preds_Vision = visionLRM.predict(X_Vision)

X_Econ = team_validation_data[['totalgold','earnedgpm','goldspent','gspd','goldat10','gdat10','goldat15','gdat15']]
preds_Econ = econLRM.predict(X_Econ)

X_Relevant = team_validation_data[['kpm', 'teamdeaths', 'teamdragkills', 'teamtowerkills', 'fttime', 'minionkills', 'monsterkills', 'visionwards', 'visionwardbuys', 'totalgold', 'earnedgpm', 'goldspent']]
preds_Relevant = relevantLRM.predict(X_Relevant)
In [192]:
# Plot the predicted values for win_percentage of all linear regression models 
# against the actual values for win_percentage
f, ax = plt.subplots(figsize=(16,11))
plt.title('Data Distribution for Actual and Predicted')

# plot actual values for win_percentage
sns.distplot(win_percentage, hist=False, label="Actual", ax=ax)

# plot linear regression values based on Champion Kills
sns.distplot(preds_champKillsLRM, hist=False, label="Linear Regression Predictions based on Champion Kills", ax=ax)

# plot linear regression values based on Not Dying
sns.distplot(preds_NotDying, hist=False, label="Linear Regression Predictions based on Not Dying", ax=ax)

# plot linear regression values based on LargeMonsterKills
sns.distplot(preds_LargeMonsterKills, hist=False, label="Linear Regression Predictions based on LargeMonsterKills", ax=ax)

# plot linear regression values based on TowerKills
sns.distplot(preds_TowerKills, hist=False, label="Linear Regression Predictions based on TowerKills", ax=ax)

# plot linear regression values based on CreepScore
sns.distplot(preds_CreepScore, hist=False, label="Linear Regression Predictions based on CreepScore", ax=ax)

# plot linear regression values based on Vision
sns.distplot(preds_Vision, hist=False, label="Linear Regression Predictions based on Vision", ax=ax)

# plot linear regression values based on Economics
sns.distplot(preds_Econ, hist=False, label="Linear Regression Predictions based on Economics", ax=ax)

# plot linear regression values based on Most Relevant factors
sns.distplot(preds_Relevant, hist=False, label="Linear Regression Predictions based on Most Relevant Factors", ax=ax)

ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()
In [ ]: